Este artículo se reconstruyó automáticamente a partir de la fuente verificada: www.nature.com.

The discovery of new materials has been fundamental to technological advancement. From the Bronze Age to the modern era, materials with novel properties have driven advancements that have shaped human civilization1. However, the methods for discovering and designing these materials have undergone a remarkable evolution, transitioning through several scientific paradigms: empirical science, model-based discovery, computational science, big data-driven science2, and now, generative AI (Fig. 1). This transformation has redefined how researchers explore and optimize the complex design spaces of materials, paving the way for faster and more efficient innovations. In its earliest stages, materials discovery was rooted in empirical science, where progress was achieved through observation, experimentation, and intuition1. Researchers relied on trial-and-error methodologies, often requiring decades to identify materials with desirable properties. The advent of theoretical science in the 20th century (Fig. 1) introduced model-based material discovery, where scientific principles such as thermodynamics and quantum mechanics provided a foundation for predicting material behavior3,4. Researchers began using equations and simplified models to guide experiments, reducing reliance on purely empirical methods2. This era saw significant advancements in understanding phase diagrams, crystallography, and other theoretical tools that enabled targeted material design. The rise of computational science in the latter half of the 20th century marked a turning point2. Advances in computing power allowed researchers to simulate material properties at atomic and molecular scales, enabling systematic exploration of complex systems. Computational tools like density functional theory (DFT) or molecular dynamics (MD) became standard, drastically accelerating discovery by testing hypotheses and refining designs virtually before experimental validation3,5,6,7,8. However, the scalability of computational methods was limited by the time and resources required for simulations, particularly for high-dimensional problems. The 21st century led the era of big data-driven science, transforming materials discovery with machine learning (ML) and data analytics2. The availability of large experimental and computational datasets allowed researchers to train ML models to predict material properties, optimize synthesis routes, and identify patterns hidden within massive datasets9,10,11,12. Unlike earlier methods, data-driven approaches exploit correlations and relationships that are not explicitly captured in theoretical models, enabling rapid screening of vast chemical design spaces. For instance, ML has been used to identify promising candidates for renewable energy13,14 materials, high-performance alloys15,16, and advanced composites17,18,19. This data-centric methodology significantly reduced the time and cost associated with traditional and computational approaches. The most recent breakthrough in materials discovery is the integration of generative AI15, which moves beyond prediction to actively design new materials. Leveraging generative adversarial networks (GANs) and variational autoencoders (VAEs)15,20, generative AI creates novel material compositions and structures that satisfy specific criteria, such as stability, performance, or sustainability. This represents a fundamental shift in materials discovery, where AI not only accelerates the process but also expands the possibilities for innovation. Alongside these advancements, open-source platforms have emerged as a transformative force in materials discovery. Open-source platforms provide a collaborative framework where researchers, industries, and governments can access, share, and develop tools, datasets, and methodologies21,22. These platforms foster global collaboration, enabling a diverse community of stakeholders to contribute to materials discovery efforts without being hindered by proprietary barriers. For example, databases like the Materials Project23, Open Quantum Materials Database (OQMD)24, and NOMAD25 provide extensive repositories of material properties and simulation results, freely accessible to researchers worldwide. These resources serve as the foundation for training ML models and validating generative AI designs. The significance of open-source platforms extends beyond accessibility, as they promote transparency, reproducibility, and standardization, ensuring that scientific discoveries are reproducible and accessible for independent verification and further advancement26,27. Additionally, open-source platforms align sustainability and ethical responsibility by addressing the environmental impact of energy-intensive workflows through the sharing of optimized algorithms and workflows that reduce computational overhead, thus promoting energy-efficient practices28,29. They also facilitate the development of community-driven guidelines for ethical data use, tackling issues such as bias, privacy, and intellectual property (IP) to ensure that AI-driven materials discovery remains aligned with societal values. The integration of open-source platforms with the frameworks of Industry 4.0 and Industry 5.0 promises a significant advancement in materials discovery (Fig. 2). Industry 4.0, defined by digitalization, automation, and interconnected systems, enables seamless incorporation of AI into experimental workflows30,31,32. Innovations like high-throughput robotics, IoT-enabled devices, cloud computing, and digital twins now support real-time monitoring, precision experimentation, and global collaboration33,34. These tools have transformed the traditional linear discovery pipeline into a dynamic, adaptive system capable of solving complex challenges. In contrast, Industry 5.0 adopts a human-centric approach, emphasizing collaboration between advanced technologies, such as AI, and human creativity to deliver meaningful solutions32. This paradigm shift in materials discovery not only prioritizes speed and efficiency but also integrates ethical, environmental, and social objectives. For instance, Industry 5.0 emphasizes developing materials that are high-performing yet biodegradable, ethically sourced, and recyclable, underscoring the need to align innovation with sustainability and human values35 The convergence of these technological advancements brings both transformative opportunities and critical challenges. While AI and Industry 4.0 technologies revolutionize discovery processes, they also raise ethical and environmental concerns, such as biases in large datasets, IP issues, and data privacy risks36,37. Moreover, the computational demands of these systems and their final use can contribute to significant carbon emissions, creating a paradox where tools aimed at solving sustainability problems may inadvertently harm the environment38,39. Industry 5.0 addresses these issues by promoting energy-efficient workflows, transparent data practices, and responsible innovation, offering a pathway to harmonize technological progress with societal and ecological goals32,35. Taken together, these developments indicate that contemporary materials discovery is governed by the coordinated interaction of data-centric practices, physics-based simulation, and ML, rather than by any single methodological paradigm40,41. Each approach is grounded in distinct assumptions about how knowledge is generated, validated, and generalized. Making these foundations explicit is essential for understanding both the opportunities enabled by AI-driven materials discovery and the structural constraints that limit its applicability. At a fundamental level, data-centric methods in materials science rest on the premise that data representation, curation, and uncertainty management, rather than model complexity alone, determine what relationships can be learned and trusted40,42,43. Typical data-processing workflows therefore follow a structured lifecycle that begins with data generation from experiments, simulations, literature, and increasingly synthetic sources41. These raw data undergo preprocessing steps such as cleaning, normalization, labeling, and harmonization to mitigate heterogeneity arising from disparate protocols and assumptions43,44. Feature construction and representation implicitly define the hypothesis space accessible to modeling, while uncertainty, bias, and data sparsity impose epistemic limits on downstream simulation and learning-based tasks42. First-principles and atomistic simulation methods occupy a complementary role by explicitly encoding physical laws, enabling causal interpretation, mechanistic insight, and theory-informed validation of material behavior45,46. Techniques such as DFT and MD provide controlled access to energetic landscapes, structural evolution, and kinetic pathways, often serving as reference standards for experimental design and model calibration45,47,48,49. At the same time, their predictive scope is intrinsically constrained by simplifying assumptions, including exchange–correlation approximations, force-field parameterizations, finite-size effects, and restricted time scales50,51,52,53. Combined with the exponential growth of configurational space, these factors limit the scalability of simulation-based exploration to complex compositional and processing regimes54,55,56. ML extends this framework by introducing scalable approximation and inference across high-dimensional design spaces, enabling rapid property prediction, surrogate modeling of expensive simulations, and generative exploration beyond enumerative search. Predictive models reduce computational and experimental cost, while generative and agentic approaches support inverse design and adaptive optimization. However, the generalization behavior of ML models remains tightly coupled to data quality, representativeness, and distributional bias, and does not inherently guarantee physical consistency or reliable extrapolation beyond the training domain57,58,59,60,61. Their effective use, therefore, depends on careful integration with curated data pipelines and physics-informed validation rather than standalone deployment. Viewed collectively, data-processing workflows, simulation methods, and ML form a coupled methodological system in which each paradigm compensates for the limitations of the others. Data infrastructures define the accessible information space, simulations anchor discovery in physical causality, and learning-based models provide the scalability required to navigate vast materials design spaces. An integrated, systems-level perspective provides the conceptual foundation for the present work, framing materials discovery as a systems-level problem in which data, models, and computational workflows must be co-designed rather than treated in isolation. By emphasizing the interdependence between methodological choices and infrastructural decisions, it highlights why reproducibility, scalability, and long-term sustainability cannot be addressed at the level of individual algorithms alone, but must instead be embedded within the overall research framework. This paper offers a comprehensive overview of the technological tools and methodologies available for researchers to design AI-powered, open-source infrastructures for materials discovery. The goal is to provide a clear guide through the key stages of the data lifecycle, spanning data collection, preprocessing, storage, organization, AI modeling, and deployment. By addressing each of these components, the paper equips researchers with the insights needed to create scalable, efficient, and sustainable platforms, ensuring the acceleration of innovation in materials science while aligning with the principles of openness and reproducibility. The discussion begins by examining traditional and modern data collection techniques, emphasizing the transition from manual experimental approaches to automated, digital systems, including in-silico simulations, synthetic data generation, and web scraping. It then explores preprocessing strategies to standardize and clean diverse datasets, followed by advanced storage solutions that leverage cloud and edge computing for scalability and real-time analysis. Data organization and indexation frameworks are highlighted for their critical role in enabling efficient retrieval and interoperability across various datasets. Subsequent sections focus on the application of cutting-edge AI modeling and data processing methodologies, emphasizing the integration of open-source frameworks that democratize access to computational tools and foster collaboration. The paper also delves into the deployment of these technologies in real-world scenarios, demonstrating their ability to address pressing challenges in sustainability, energy efficiency, and resource optimization. Additionally, the environmental impact of computationally intensive workflows is analyzed, with a particular focus on strategies to minimize the carbon footprint of AI-driven materials discovery. Together, these elements form a roadmap for researchers seeking to develop innovative, sustainable, and accessible platforms for advancing materials science. Figure 3 presents a structured overview of the manuscript, outlining the scope and organization of the proposed end-to-end AI infrastructure for materials discovery. Traditional data collection Traditional data collection in materials discovery has largely relied on experimental methods, where scientists synthesize materials in the lab and observe their properties through trial-and-error approaches. This process involves testing different chemical compositions, processing conditions, and performance metrics like mechanical strength, thermal stability, and environmental impact. Often, these experiments are time-consuming, resource-intensive, and dependent on the expertise of the researchers conducting them62,63. Large datasets from these experiments are compiled over years of iterative trials, making material discovery slow and costly. Additionally, these experiments typically produce small-scale data with limited diversity, hindering the exploration of a wider range of material possibilities. Moreover, traditional data collection methods often suffer from a lack of standardization across different research groups, which can lead to inconsistencies and difficulties in comparing data64. Experimental conditions, such as temperature, pressure, or purity of the chemicals, can vary, affecting the reproducibility of results. Data from academic or industrial research may also remain siloed, limiting collaborative efforts and the sharing of knowledge that could accelerate sustainable materials discovery. Although traditional methods have led to breakthroughs, they lack the speed and scale needed to address modern challenges like climate change, resource scarcity, and environmental degradation in a timely manner. As a result, there is a growing push towards integrating computational tools and large-scale databases to complement traditional experimental methods for more efficient and scalable discovery. Open-source datasets and databases play a crucial role in the research and discovery of new materials. In the fields of chemistry and materials science, these databases provide key information on the chemical, biological, and structural properties of a wide range of compounds, granting access to experimental data that might otherwise be difficult to obtain. These databases enable researchers to perform complex analyses, run simulations, and conduct virtual screenings of compounds for the development of new materials, pharmaceuticals, and other applications. There are various open and specialized databases widely used in materials research (Table 1). PubChem65 is a comprehensive database that offers detailed information on the chemical properties and biological activities of small molecules, covering over 113 million compounds. Similarly, ChEMBL66 provides data on bioactive molecules with drug-like properties, while the Crystallography Open Database (COD)67 offers access to crystal structure data for organic, inorganic, and metal-organic compounds. Each database has a specific focus, whether on organic molecules, crystal structures, or inorganic compounds, and their open access fosters significant collaboration and progress in the field of sustainable materials. Digital and automated data collection Smart systems Self-driving laboratories (SDLs) are revolutionizing scientific research by integrating automation, ML, and robotics to accelerate discoveries in fields such as chemistry and materials science68. These cutting-edge systems automate the entire experimental process, enabling the creation of a closed-loop framework where experiments are designed, executed, and optimized without human intervention69,70. This transformative capability allows researchers to navigate vast chemical and material landscapes with unprecedented speed and precision, achieving breakthroughs that would have required significantly more time and effort using traditional laboratory approaches68. As depicted in Fig. 4, SDLs operate within an iterative closed-loop system driven by user-defined goals, predictive modeling through AI and ML, advanced experimental designs, automated synthesis, real-time characterization, and digital feedback for optimization. This workflow not only accelerates discovery cycles but also enhances resource efficiency, sustainability, data integrity, and reproducibility. Moreover, SDLs facilitate global collaboration and knowledge sharing, which are critical for advancing innovation in materials science. The integration of digital twins, virtual representations of physical laboratory equipment, has further extended the capabilities of SDLs. These digital models enable remote monitoring and adjustment of experiments, fostering seamless collaboration across geographical boundaries. For example, SDLs in laboratories located in different countries, such as Cambridge and Singapore, can collaborate to optimize chemical reactions in real time34. On the other hand, adherence to the FAIR guiding principles (Findable, Accessible, Interoperable, and Reusable) within SDL frameworks not only enhances transparency but also supports effective teamwork, both of which are essential for contemporary research practices44. One of the most significant advantages of SDL technology lies in the reduced cost of automation components, such as robotics and 3D printing, which have made these systems more accessible. Remote control capabilities allow laboratories to operate SDLs over the internet, eliminating the need for substantial infrastructure investments. This democratization of technology accelerates innovation (Fig. 4b) and extends the transformative potential of SDLs to a broader range of research institutions and disciplines71. Advancements in digital and automated data collection have also highlighted the critical role of open-source software in laboratory automation. Tools such as Chemspyd, an open-source Python-based interface for Chemspeed Technologies’ robotic platforms, exemplify this trend. Chemspyd offers dynamic control over automated experiments, integrating seamlessly into existing workflows to enhance flexibility in experimental design. This innovation fosters modular laboratory environments, increases the speed and reproducibility of experiments, and enables continuous, 24/7 laboratory operations72. SDLs are playing a pivotal role in the development of clean energy technologies by enabling rapid testing of materials such as advanced photovoltaics and thermoelectrics13,73. Researchers can explore diverse material combinations at an accelerated pace, identifying energy-efficient and cost-effective solutions far faster than traditional methods73,74. The discovery of sustainable materials through SDLs supports global efforts to transition to low-carbon technologies, contributing to long-term sustainability and reduced environmental impact75. A notable example of this transformative capability is the Ada SDL76, which leverages automated synthesis and digital feedback loops like those depicted in Fig. 4 to optimize clean energy materials like thin-film solar cells. Ada significantly reduces the need for resource-intensive experimentation, enhancing the performance of energy-harvesting technologies while conserving resources76. By facilitating faster innovation in clean energy, SDLs like Ada are propelling the transition to a more sustainable and energy-efficient future. In parallel with the development of SDLs, the field of intelligent manufacturing has emphasized the use of digital twins and ML to enable predictive, adaptive, and resource-efficient systems. Digital twins are widely employed to represent physical processes and assets, supporting monitoring, optimization, and decision-making under real-world operating constraints. However, when digital twins rely exclusively on data-driven learning models, their predictive reliability may degrade under sparse data conditions, extrapolative regimes, or dynamically changing environments. In such cases, purely statistical models can yield physically inconsistent predictions, including violations of conservation laws or thermodynamic constraints, limiting their robustness and long-term applicability77,78,79. Physics-informed ML (PIML), including physics-informed neural networks (PINNs), has emerged as a principled approach to address these limitations by embedding governing physical laws directly into the learning process77,78,80,81. Through constrained architectures, physics-aware feature engineering, or regularized loss functions, PIML enables hybrid modeling paradigms in which experimental data, numerical simulations (e.g., MDs or finite-element methods), and first-principles equations jointly inform model training. These approaches have demonstrated improved data efficiency, interpretability, and robustness across materials-relevant domains, including thermal transport, chemical reaction systems, additive processes, and structure–property modeling under limited or noisy data availability78,79,82,83,84. By enforcing thermodynamic, mechanical, or transport constraints, PIML-based models reduce nonphysical predictions and improve predictive stability85,86,87. These developments are directly relevant to SDLs, where digital models increasingly guide experiment selection, parameter optimization, and closed-loop decision-making. By incorporating physics-informed learning into SDL digital twins, autonomous experimental workflows can achieve greater stability, interpretability, and resilience while reducing failed experiments and unnecessary resource consumption. In this sense, SDLs provide a controlled laboratory-scale environment in which concepts from intelligent manufacturing, such as physics-informed digital twins and sustainability-oriented optimization, can be developed, validated, and refined before translation to larger-scale systems. This positioning allows SDLs to benefit from advances in intelligent manufacturing and physics-informed modeling while maintaining their primary focus on materials discovery and experimental research, and aligns naturally with emerging priorities related to sustainable, reliable, and human-centered autonomous systems. Synthetic data High-fidelity computational modeling, or in silico simulation, has emerged as a cornerstone for data generation in materials discovery workflows. By providing atomic- to mesoscale-level insights into material properties and behaviors, these simulations complement experimental efforts and greatly expand the accessible design space. Techniques such as MD, first-principles calculations based on DFT, and advanced quantum chemistry methods enable researchers to virtually screen candidate materials, predict phase stability and electronic structure, and characterize interfacial phenomena. The resulting datasets serve as critical inputs for AI models, informing materials property prediction, design optimization, and guiding subsequent experimental validation55,88,89,90. A robust ecosystem of computational tools underpins these efforts. Widely adopted MD packages such as LAMMPS91, GROMACS92, and NAMD93 provide scalable, community-driven platforms for simulating molecular interactions in complex systems. Meanwhile, first-principles frameworks like Quantum ESPRESSO94, ABINIT94, and VASP95 offer quantum-mechanical precision for evaluating electronic structures, and quantum chemistry packages like ORCA72 and Gaussian82 support high-accuracy electronic structure calculations. These software tools, along with others, are summarized in Table 2, which categorizes them by simulation type and accessibility, aiding researchers in identifying appropriate tools for their projects. Leveraging these computational tools, researchers have tackled a range of sustainability-relevant challenges. For example, MD simulations have been employed to study such as the degradation mechanisms of battery electrolytes, the properties of perovskites as electron transport layers in solar cells, green surfactants for enhanced oil recovery, and superhydrophobic cellulose derivatives for sustainable packaging96,97,98,99 and also they support improvements in material processing, such as optimizing energy-intensive processes like drying in the forest products industry by studying interactions at the molecular level56. Additionally, first-principles simulations have supported the solvation properties of lignin fragments for renewable fuel production, novel materials for renewable energy applications, hybrid nanotubes for next-generation optoelectronics, and hydrogen storage materials100,101,102,103,104. These sustainability-driven applications exemplify the versatility and impact of in silico methods. Alongside these simulation tools, open-access materials databases, described in Table 3, compile extensive thermodynamic, structural, and electronic information gleaned from both experimental results and DFT calculations. Resources like the Materials Project23, the OQMD24, and Aflowlib105 provide standardized reference datasets, enabling researchers to benchmark their simulations, optimize materials more efficiently, and train AI models for property prediction. In parallel with direct ab initio simulations, ML has increasingly been integrated into first-principles and atomistic workflows as a means of accelerating data generation while retaining a physics-based foundation. In this approach, machine-learning models are trained on reference electronic-structure calculations, most commonly DFT, to learn potential energy surfaces and interatomic forces, enabling MD simulations at length and time scales inaccessible to direct quantum-mechanical methods. This paradigm has been comprehensively reviewed in the context of both materials science and quantum chemistry, highlighting its ability to extend first-principles accuracy to large-scale simulations while maintaining explicit links to the underlying physical theory106,107,108. Representative frameworks include high-dimensional neural network potentials109, Gaussian approximation potentials110, and more recent symmetry-equivariant graph neural networks that enforce rotational and translational invariances in atomic environments111. These studies consistently emphasize that the reliability of ML-accelerated simulations remains fundamentally constrained by the diversity and coverage of the training data. Predictive performance degrades under extrapolative conditions, and physically inconsistent behavior may emerge outside the sampled chemical or configurational domain106,107,108. Consequently, ML-enhanced atomistic simulations are best understood as enabling infrastructure for simulation-driven materials discovery rather than autonomous discovery algorithms, requiring ongoing validation, uncertainty-aware sampling strategies, and periodic retraining against high-fidelity quantum-mechanical reference calculations. ML-generated synthetic data has emerged as a vital strategy to address challenges such as small or biased datasets in materials science. Unlike in silico simulations, which model real-world physical processes based on theoretical principles, generated data mimics statistical patterns in experimental datasets. ML techniques, such as VAEs, GANs, and data augmentation, play a central role in generating synthetic datasets. These approaches create novel molecular representations and expand the diversity of existing datasets, enhancing predictive capabilities112,113,114,115. Beyond accelerating discovery, in silico simulations and generative AI can reduce the environmental and logistical burdens associated with experimental work by minimizing the need for extensive physical testing. However, this shift to computational methods does not guarantee sustainability. Large-scale, high-fidelity simulations demand substantial computational resources, incurring notable energy consumption and carbon footprints. As simulations grow in complexity, the associated energy costs can become significant. Data scraping from publicly available sources Data scraping is a method used to automatically extract large volumes of data from publicly available sources such as websites, allowing researchers to gather information that is not easily accessible through conventional means115. It involves parsing HTML content, interacting with dynamic elements, and collecting structured or unstructured data. In materials science, where data is essential for understanding material properties, synthesis methods, and experimental outcomes, web scraping holds significant potential. By automating the extraction of relevant data from scientific journals, specialized databases, and research repositories, web scraping can facilitate the compilation of large datasets. This is crucial for data-driven research and ML applications, where the quality and volume of data directly impact the accuracy of predictive models. The application of web scraping techniques (Table 4) in materials science could transform the way researchers access and analyze scientific data. Extracting experimental details, property measurements, and synthesis conditions from journal articles or database entries can feed into comprehensive datasets, helping to identify trends, predict material behaviors, and inform the design of novel compounds. Tools like Beautiful Soup116 and Scrapy116 can parse structured HTML data from simpler websites, while Selenium117 and Puppeteer118 are valuable for scraping dynamically loaded content from JavaScript-heavy pages. Visual tools such as Octoparse119 and ParseHub120 offer a more user-friendly approach, making them accessible for quick extractions, though they might lack the flexibility required for complex tasks. For enterprise-level needs, advanced solutions like Diffbot121 and Content Grabber122 provide automated, AI-driven data extraction capabilities, which could be leveraged for large-scale data collection in material science applications. Automated data extraction from scientific literature Advances in automated data extraction and ML are reshaping materials science, speeding up discovery processes that were previously thought impossible. AI, particularly large language models (LLMs) like BERT, GPT-3, and GPT-4, researchers have built powerful tools to compile extensive databases, significantly reducing the time, cost, and manual effort traditionally involved in data collection. These AI-driven methods have opened transformative opportunities for innovation, enabling a shift towards data-centric approaches in materials science123. A key development in this field is the use of transformer-based LLMs to extract valuable insights from scientific literature. Tools like BatteryDataExtractor and ChatExtract124,125 utilize these advanced models to process unstructured text, extracting meaningful material-property relationships with impressive accuracy. The NEMAD project126 exemplifies this synergy, combining LLM-based data parsing with predictive ML models to demonstrate the effectiveness of hybrid approaches. Additionally, recent work on large-scale polymer data extraction has shown that domain-specific models, such as MaterialsBERT127, outperform generic LLMs in recognizing materials science terminology. MaterialsBERT, trained on 2.4 million materials science abstracts, has enabled the extraction of over 300,000 material property records from approximately 130,000 polymer-related abstracts within 60 hours127. This large-scale automated extraction significantly accelerates knowledge retrieval compared to manual curation, providing structured material property datasets crucial for developing predictive models. Recent advancements have further expanded the scale of AI-driven data extraction in materials science. A newly developed LLM-based framework successfully extracted over one million polymer-property records from 681,000 full-text journal articles, encompassing 24 material properties and over 106,000 unique polymers128. This study employed both commercially available (GPT-3.5) and open-source (LlaMA 2) models in conjunction with the named entity recognition (NER)-based MaterialsBERT model. The extracted dataset, made publicly available through the Polymer Scholar website, represents the most extensive polymer-property dataset compiled to date. The study also provided key insights into the computational efficiency, accuracy, and cost-effectiveness of different AI models, demonstrating that GPT-3.5 extracted significantly more records than previous methods while maintaining high precision128. Another notable example is the use of AI-driven methods to extract and analyze data on polymer solar cells. A recent study employed a natural language processing (NLP) pipeline to extract donor-acceptor combinations and their corresponding power conversion efficiencies from over 3300 research papers129. This dataset, nearly five times larger than previous manually curated collections, enabled the training of ML models to predict the efficiency of new donor-acceptor pairs. By leveraging active learning strategies, these models demonstrated a potential 75% reduction in material discovery time, equivalent to accelerating innovation by 15 years129. This showcases the transformative potential of AI in accelerating materials discovery, particularly in emerging fields like organic photovoltaics. Similarly, a dataset comprising 3943 molecular mutations for metal-organic framework (MOF) linkers130. In this effort, GPT-assisted tools proposed and edited chemical structures, while human reviewers validated outputs to ensure reliability. A fully automated pipeline was also developed to extract synthesis procedures from scientific literature, transforming textual descriptions into flow graphs through deep learning and rule-based systems131. Extending automation further, GPT-4V’s vision capabilities have been employed to analyze images from 346 articles, automating the classification and labeling of scientific images with minimal human intervention132. Meanwhile, the MaterialsBERT pipeline has demonstrated its ability to recover non-trivial insights across various material applications, such as fuel cells, supercapacitors, and polymer solar cells, through large-scale literature mining127,133. These efforts illustrate how AI-driven approaches are transforming dataset creation, from partially supervised methods to fully autonomous systems. Diverse techniques have been implemented for automating data extraction. Traditional NLP methods, such as entity recognition used in Snowball 2.0134, are being combined with advanced strategies like prompt engineering in projects such as ChatExtract and NEMAD. Another innovative approach involves uncertainty quantification, such as Shannon entropy, as demonstrated in the Uncertainty-Informed Screening study, which assesses the reliability of AI predictions135. Despite these successes, challenges persist in scaling up models to handle larger datasets and extracting information from complex formats like tables and figures. Moreover, many tools depend on specialized datasets, which can restrict their applicability across diverse subfields within materials science. The wider impact of these AI-powered tools is significant. By automating the extraction of material properties and building accurate, comprehensive databases, materials science is shifting toward a data-centric paradigm that minimizes manual curation. However, realizing the full potential of AI in this field requires addressing significant challenges, including scalability, dataset diversification, and the ability to process multimodal information sources. To overcome these hurdles, future efforts should focus on enhancing the generalizability and scalability of AI models. Developing multimodal extraction capabilities to accommodate diverse data formats and designing specialized LLMs tailored to subfields within materials science could help bridge existing gaps. Embracing continual learning and incorporating robust uncertainty quantification across a broader range of models will further enhance the practical utility of these tools for large-scale applications. Data preprocessing Data preprocessing is a crucial first step in building AI-powered infrastructure, especially in the context of materials discovery, where vast and diverse datasets need to be integrated, cleaned, and standardized. The quality of data directly influences the performance of AI models, making preprocessing essential for extracting meaningful insights. In materials science, data often comes in heterogeneous formats, creating unique challenges that require robust preprocessing techniques. While tools such as Excel136 are frequently used for initial, straightforward data cleaning due to their intuitive interface and widespread familiarity among researchers, they are limited in scalability and automation capabilities. These limitations make them unsuitable for the large and complex datasets typically encountered in materials science projects, where robust and scalable preprocessing is essential. To address these needs, more advanced tools are widely utilized, including Python’s Pandas library137,138 and OpenRefine, or Dplyr for R139,140,141. Pandas is a versatile and powerful library designed for data manipulation, offering extensive functions for handling missing values, filtering datasets, and merging complex data structures. Its strength lies in its integration with Python’s broader ecosystem, making it particularly effective for preprocessing materials science data, which often needs extensive transformation. On the other hand, OpenRefine provides a more user-friendly, interactive approach for handling messy data, particularly useful for initial data cleaning tasks like identifying inconsistencies and standardizing entries. However, its lack of scalability limits its use for large datasets typical in materials science. Normalization and data transformation are critical for standardizing diverse datasets, ensuring that the input features are comparable and suitable for ML algorithms. NumPy is the go-to tool for numerical operations in Python, offering efficient methods for array manipulation and normalization. Its simplicity and speed are advantageous when dealing with the large, numeric datasets often encountered in materials science, such as high-throughput experimental data. In comparison, tools like Apache Spark142 and Talend143 offer more scalable solutions for large-scale data processing, enabling distributed computing that can handle vast amounts of data typical in AI-powered infrastructure projects. Spark’s integration with Python and its in-memory processing capabilities make it well-suited for big data applications, though it requires a more complex setup and greater expertise compared to simpler, standalone libraries like NumPy. A unique aspect of materials discovery is the need to preprocess chemical data, such as molecular structures and properties. This is where RDKit144 plays a significant role. RDKit is an open-source toolkit designed for cheminformatics, providing functions to process and analyze chemical structures. It excels in tasks like molecule standardization, fingerprint generation, and chemical feature extraction, making it invaluable for working with data on compounds and materials. When integrated into a Python-based preprocessing pipeline alongside Pandas and NumPy, RDKit allows seamless extraction and transformation of molecular descriptors, which are essential for building predictive models in materials science. Its open-source nature and extensive functionality make it a preferred tool for researchers looking to include chemical information in their AI models. For data transformation workflows, tools like KNIME145 and Talend offer a more visual approach, allowing researchers to build complex preprocessing pipelines without extensive coding. KNIME’s integration capabilities make it particularly useful for handling diverse data types, including numerical, text, and chemical data processed by RDKit. However, the reliance on graphical user interfaces can limit flexibility, especially when automation is needed for large-scale projects. Alternatively, R provides strong statistical and data manipulation functions, which can be particularly advantageous for materials researchers who are comfortable with its syntax. While commercial tools like Alteryx146 offer advanced features for data transformation and blending, their closed-source nature and licensing fees can be a barrier in academic and open-source projects. By integrating tools like RDKit for cheminformatics with scalable frameworks such as Apache Spark and user-friendly libraries like Pandas and NumPy, researchers can create robust preprocessing pipelines capable of handling diverse data formats. This integrated approach allows for the inclusion of complex molecular descriptors in AI models, improving predictive accuracy and accelerating materials discovery. The ability to preprocess large, heterogeneous datasets effectively not only streamlines the research process but also supports sustainability goals by reducing inefficiencies in data preparation, aligning with the broader vision of AI-powered materials discovery. Data storage in the cloud and edge computing Cloud and edge computing offer unique advantages for AI-powered, open-source infrastructure in materials discovery. Cloud computing stands out for its scalability and processing power, making it well-suited for handling large datasets and running complex AI algorithms147. Its centralized architecture provides extensive computational resources but often faces issues with higher latency, which can be problematic for real-time analysis. On the other hand, edge computing processes data closer to its source, reducing latency and enabling faster, real-time analysis148. This makes it particularly useful for applications needing immediate feedback, such as in experimental setups. However, edge devices generally have less computational capacity and storage compared to cloud platforms, limiting their ability to manage extensive datasets or highly complex models. In the context of AI-driven, open-source infrastructure for materials discovery, data storage and computational resources are critical, especially when utilizing both cloud and edge computing. Major cloud service providers like Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) have played a key role in enabling high-performance computing (HPC) workflows. Their infrastructure supports scaling in computational materials science, providing the flexibility needed to handle the large datasets generated during material screening and optimization processes. However, while cloud computing offers significant benefits in scalability and accessibility, it also brings challenges related to energy consumption, carbon footprint, and long-term sustainability. To illustrate the practical applications and effectiveness of cloud platforms, recent studies have demonstrated the scalability of cloud platforms in materials discovery. For example, one study utilized up to 1000 virtual machines (VMs) on Microsoft Azure to screen over 32 million material candidates, achieving computational throughput that would be difficult to replicate with traditional on-premises HPC resources149. Similarly, a decentralized, cloud-based approach enabled globally dispersed laboratories to collaborate asynchronously, highlighting the potential of distributed research infrastructure150. These examples showcase the capability of cloud providers to democratize access to advanced computational tools, lowering barriers for researchers and enabling a more inclusive approach to computational materials discovery. Despite these advantages, environmental concerns persist. Cloud providers are making efforts to reduce the environmental impact of large-scale computing by utilizing renewable energy sources and developing carbon-neutral data centers151,152,153. However, the carbon footprint of extensive cloud computing remains a significant concern, especially in fields like materials science, where computational tasks are both resource-intensive and lengthy. In this regard, edge computing offers a promising alternative or complement to traditional cloud storage. Companies such as Cisco Systems, Intel Corporation, and NVIDIA are leading the development of edge computing solutions tailored for scientific and industrial applications. Edge computing minimizes the need for frequent data transfers between local devices and central cloud servers, helping to alleviate network latency and bandwidth issues. By processing data closer to its source, edge computing can reduce the energy costs associated with data movement and potentially decrease the overall carbon footprint. However, edge computing also has its limitations, particularly in terms of reduced computational power and storage capacity compared to centralized cloud platforms154. Given these considerations, balancing cloud and edge computing for AI-powered materials discovery requires careful consideration of these trade-offs. Cloud platforms excel in tasks requiring extensive computational resources and seamless tool integration. In contrast, edge computing is more suitable for real-time data processing in localized experiments, where reducing latency and avoiding data transfer delays are crucial. A hybrid approach that combines cloud and edge solutions could offer a practical way forward, optimizing task allocation based on computational needs and energy efficiency155. Furthermore, industry initiatives are addressing these sustainability concerns. New partnerships and initiatives are emerging to tackle the carbon footprint associated with data storage and cloud computing. Microsoft’s collaboration with OpenAI includes commitments to building carbon-negative data centers, while Google aims to power its cloud infrastructure entirely with carbon-free energy by 2030153, signaling a broader industry shift towards sustainability. However, the actual impact of these initiatives on the environmental footprint of computationally intensive materials discovery workflows remains to be thoroughly evaluated. Lastly, data transfer bottlenecks remain a critical issue, given the large volumes of data typical in computational materials science. Companies like IBM and Oracle are integrating advanced storage solutions and content delivery networks (CDNs) to help address these challenges. However, widespread adoption and detailed analysis of these tools are still limited in the current literature, highlighting an area that requires further exploration and refinement. Addressing the environmental impacts and data transfer challenges through sustainable practices and innovative technologies will be crucial for the long-term success and sustainability of these infrastructures. By leveraging the strengths of both cloud and edge solutions, the materials science community can enhance collaboration, improve efficiency, and drive significant breakthroughs in materials discovery. Data organization and indexation Organizing data in materials science repositories presents several key challenges. The diversity of data types from crystal structures to electronic properties makes developing a unified schema challenging. Scalability is also a major issue, as current systems struggle to handle the surge in data without sacrificing performance. Furthermore, a lack of interoperability due to the absence of universal standards complicates the integration of datasets from various sources, hindering the potential for discovery. Many repositories also fail to adequately record data provenance, which hampers the verification and reproducibility of results. To address these issues, new approaches in data organization are being explored. For example, the Materials Project has implemented flexible data models that can easily incorporate new data types without extensive restructuring23. Initiatives like the European Materials Modelling Ontology (EMMO)156 aim to establish a common framework for describing materials and processes, though broad adoption remains limited. AiiDA157 leverages a graph-based data structure to capture complex relationships between computations and data, enhancing reproducibility; however, this method can be resource-intensive for large datasets. In this context, the FAIR principles44 have become a widely recognized standard for organizing scientific data. These guidelines advocate for creating data repositories that are easily searchable, accessible via standardized protocols, interoperable with other systems, and reusable for future research. Although the adoption of FAIR principles is increasing in materials science158,159, challenges remain, particularly with achieving interoperability between platforms and standardizing metadata. Effective indexing is vital for the performance and usability of materials repositories, but poses its own challenges. Maintaining multidimensional indexes to support efficient searches across various properties can be computationally demanding. There is often a trade-off between speed and accuracy, as approximate indexing methods offer faster searches but may compromise precision. Additionally, keeping indexes current in dynamic repositories presents ongoing technical challenges. Several innovative indexing strategies are under development. Hierarchical indexing, used in systems like AFLOW105, supports efficient searches across multiple levels of detail but can be complex to implement in smaller systems. Semantic indexing, adopted by repositories like OMDB160, uses NLP to categorize materials based on textual descriptions, though these techniques still require further refinement. Adaptive indexing, which modifies its structure based on query patterns, shows potential but has not yet been scaled in materials databases. The impact of efficient data organization and indexing extends directly to sustainability efforts. Better data management reduces the computational load of searches and analyses, thereby lowering energy consumption. Identifying similar materials more efficiently can eliminate redundant calculations, saving valuable computational resources. Additionally, faster and more precise searches can accelerate the discovery of environmentally friendly materials, indirectly contributing to a lower carbon footprint across industries. Despite these advancements, significant work remains in optimizing data organization and indexing for materials discovery. The lack of universal standards and the computational challenges of advanced solutions are persistent barriers. The research community must address these issues with a focus on reducing the environmental impact of the discovery process. The future of AI-powered open-source infrastructures will depend on developing systems that prioritize both performance and energy efficiency. Only then can we create a comprehensive framework that effectively tackles sustainability and carbon footprint challenges in materials discovery. Data processing Effective data processing in materials discovery necessitates the integration of diverse methodologies. The predictive modeling stage involves converting raw experimental or computational data into meaningful features and building models to estimate material properties. Essential tools in this process include Pymatgen161 and Matminer162. Pymatgen is a comprehensive Python library for materials analysis, offering robust tools for parsing structural data, performing symmetry analysis, and calculating physical properties such as band gaps and formation energies. Its integration with databases like the Materials Project facilitates the rapid acquisition of high-quality materials data, which can then be preprocessed into numerical descriptors. Building on Pymatgen’s capabilities, Matminer provides a rich set of feature extraction functions tailored for materials science, automating the generation of descriptors based on composition, structure, and electronic properties for use in statistical analysis and ML models. For statistical processing, Scikit-learn163 is preferred due to its extensive array of algorithms for regression, classification, and clustering. This stage typically begins with exploratory data analysis to identify patterns and correlations between material descriptors and target properties. Techniques such as principal component analysis (PCA) and random forest regression in Scikit-learn help determine feature importance and reduce dimensionality, enhancing both interpretability and model performance. For example, features extracted by Matminer can be used with a Scikit-learn random forest regressor to predict mechanical properties like hardness or elasticity, serving as a baseline model for further refinement through deep learning techniques. Handling large datasets is a common requirement in materials discovery, given the vast number of potential chemical compositions and structural configurations. The Atomic Simulation Environment (ASE)164 provides essential tools for managing big data from high-throughput simulations. ASE facilitates the automation of atomistic simulations, supports various computational chemistry codes (including Quantum ESPRESSO), and is designed for scalability in high-performance computing environments165. Real-time data analysis is increasingly important in materials discovery, enabling adaptive learning systems and dynamic optimization of experimental conditions. Deep learning frameworks such as TensorFlow166 and PyTorch167 are pivotal for implementing models that process real-time data streams. TensorFlow supports the development of neural networks capable of handling large, complex datasets and deploying models in real-time environments, which is useful for integrating ML models into experimental setups. PyTorch’s dynamic computation graph offers flexibility for building experimental neural architectures and supports reinforcement learning for real-time optimization tasks. In cheminformatics, RDKit complements the real-time processing capabilities of TensorFlow and PyTorch by providing tools for handling chemical information and molecular structure analysis. RDKit enables efficient preprocessing of chemical data, including molecular fingerprinting and the calculation of descriptors related to molecular shape, electronic properties, and reactivity, as previously discussed. These features can be combined with deep learning frameworks to develop predictive models that process molecular data in real time. Building on the foundation of robust data processing and real-time analysis, AI modeling has emerged as a transformative approach in materials discovery. By leveraging advanced ML and deep learning techniques, AI enables the development of predictive models capable of uncovering complex patterns and relationships in high-dimensional datasets. These models not only enhance the efficiency of property predictions and material screening but also facilitate the exploration of novel materials through generative frameworks and reinforcement learning. The integration of AI into materials research paves the way for accelerated innovation, bridging the gap between experimental data, computational simulations, and practical applications. AI modeling Traditional ML models in materials science ML has become a transformative tool in materials science and engineering (MSE), enabling innovative solutions in material discovery, process optimization, and sustainability. Figure 5 provides a comprehensive overview of a typical ML workflow in MSE, illustrating the diverse data inputs, ML methodologies, and their practical applications. Traditional ML algorithms, such as support vector machines (SVMs), artificial neural networks (ANNs), and Decision Trees, have demonstrated their effectiveness in optimizing material synthesis processes, including nanofiltration and biomaterial development12,168. Predictive modeling techniques, such as Bayesian optimization and random forests (RF), have minimized experimental requirements while advancing research on cellulose composites, ionic liquids, and stable metallic glasses. These advancements have enabled breakthroughs in smart packaging and biocompatible hydrogels17,169,170. Deep learning models for material property prediction Deep learning models extend the capabilities of traditional ML by extracting hierarchical features from large, complex datasets. Advanced neural networks, such as polymer-unit graph neural networks (PU-MPNNs), have revolutionized material property predictions for organic semiconductors, reducing training times by 98% while maintaining over 80% accuracy171. Similarly, Universal Interatomic Potentials (UIP) models, including MACE and CHGNet, have achieved remarkable accuracy with RMSEs as low as ~0.3–0.7 J/m², driving innovation in catalysis and nanotechnology applications172. In materials characterization, convolutional architectures like CNNs and DenseNets have automated microstructure classification with accuracies exceeding 94%. Tools like DeepXRD have further enhanced structural analysis by predicting X-ray diffraction (XRD) spectra with unprecedented precision173,174,175. Graph-based and multimodal learning frameworks have also advanced the field by integrating diverse datasets to enhance predictive accuracy. Models like SA-GNN and GeoCGNN have refined property predictions, achieving MAE values as low as 0.044 eV/atom for formation energy. Meanwhile, the MultiMat framework has reduced MAE by 10% for elastic tensors through multimodal data integration176,177,178. For imaging tasks, convolutional approaches like multi-scale refocusing networks (MRNs) and Mask R-CNN have optimized material image analysis, automating processes such as defect detection, nanoparticle recognition, and structural classification with accuracies exceeding 90%18,175,179,180,181. CNN-based methods have also facilitated damage detection in sandwich composites with accuracies over 94%, while AI models like Cellpose have promoted innovations in biofouling-resistant materials, reducing bacterial adhesion by 98%18,182. These advancements underscore the pivotal role of graph and image-based learning in sustainable materials science. Deep neural networks (DNNs) have outperformed traditional methods like Random Forest Regression in predicting lattice constants and band gaps for inorganic perovskites, achieving RMSE values below 0.1 Å and 0.7 eV, respectively. These improvements have significantly expedited research in optoelectronics14. Hybrid deep learning–ensemble learning (DL-EL) models, optimized with Genetic Algorithms, have achieved R² values of 0.9863 for Cu–Ni–Si alloys, enhancing predictions of hardness and tensile strength16. These examples highlight how ML-driven innovations are fostering efficiency across material discovery pipelines. One particular challenge in deep learning–based material property prediction is extrapolation—the ability to make accurate predictions for materials beyond the model’s training distribution. Traditional models tend to be interpolative, limiting their effectiveness to chemical spaces near the data they have seen. Recent analyses suggest that many Out-of-Distribution (OOD) evaluations in materials science may actually reflect interpolation rather than true extrapolation, leading to overestimated generalization capabilities183. For instance, many ML models perform well on seemingly OOD tasks simply because the test data remain within the representational domain of the training set rather than exploring truly novel chemical spaces. However, recent advances in meta-learning and attention-based neural architectures have shown strong potential to tackle this issue. The extrapolative episodic training (E2T) framework184, for example, employs matching neural networks (MNNs) to develop extrapolative capability through training on artificially generated extrapolative tasks. This approach has proven effective in polymeric and hybrid organic–inorganic perovskite systems, demonstrating superior generalization to materials outside the initial training set. Complementing such frameworks with more rigorous OOD benchmarks184, would further enhance model robustness by ensuring evaluations account for materials that truly differ from the training data in both chemical composition and structure. While predictive models enhance our understanding of known materials, generative AI opens a new frontier, allowing algorithms to create entirely novel compounds tailored to performance, sustainability, and application-specific needs. Federated learning for collaborative materials informatics Federated learning (FL) is a decentralized training paradigm in which model updates are shared while raw data remain local, enabling collaborative model development under confidentiality, intellectual-property, or regulatory constraints185. This is particularly relevant to materials science because experimental and computational datasets are routinely fragmented across institutions (“data islands”), and transferring high-volume data can be impractical or undesirable. In materials informatics, parameter-exchange training across multi-source databases has been shown to produce structure–property models (e.g., formation energy prediction) with accuracy close to that obtained from fully centralized datasets, while reducing bandwidth demands and preserving local control over data assets186. Related work in the physical sciences demonstrates that horizontal FL can mitigate data scarcity and selection bias across heterogeneous contributors, and that federated dimensionality reduction can improve convergence and reduce inter-client statistical heterogeneity in materials-relevant datasets such as metallic nanoparticles187. Beyond methodological feasibility, secure and scalable implementations have also been proposed for large collaborative materials ecosystems: MatSwarm integrates FL with blockchain-based coordination and trusted execution environments to address non-i.i.d. materials data, governance, and adversarial risks while supporting large-scale multi-institution participation188. In smart manufacturing, FL has been actively explored as a privacy-preserving route to improved process monitoring and quality control across distributed production sites, where pooling raw process data across factories is often infeasible. For metal additive manufacturing, an FL-based case study for melt-pool anomaly (defect) detection shows performance comparable to centralized learning while maintaining confidentiality, and highlights robustness across varying melt-pool conditions that are expected in real deployments189. More broadly, FL has been demonstrated for part qualification and dimensional prediction across additive-manufacturing factories, where the approach can achieve centralized-learning accuracy under statistically similar production regimes while substantially outperforming isolated local training when each site has limited data190. However, these studies also emphasize that FL performance can degrade when participating sites exhibit strong distribution shifts (e.g., different defect regimes or quality levels), indicating that federation design and client heterogeneity management are central practical challenges rather than secondary implementation details. Collectively, these findings position FL as a principled mechanism for privacy-preserving collaboration in materials research and manufacturing, while underscoring that practical success depends on explicitly addressing heterogeneity, imbalance, and governance rather than assuming federated training alone resolves these constraints188,189,190. Explainable artificial intelligence in materials science Explainable artificial intelligence (XAI) has become a practical methodological layer in materials science because many high-performing models (e.g., gradient-boosted trees, DNNs, and graph-based architectures) are otherwise difficult to interrogate, limiting scientific trust, debugging, and downstream decision-making. In contemporary materials workflows, explainability is increasingly used for three tightly coupled purposes: (i) validating whether learned relationships are consistent with domain expectations; (ii) identifying influential variables that can be acted upon in process optimization or materials design; and (iii) translating model behavior into representations that enable human-guided hypothesis generation and constrained exploration of chemical and processing spaces191,192. This positioning is particularly relevant for open-source, modular infrastructures intended to support accelerated discovery and advanced manufacturing, where models must be auditable, reusable across datasets, and interpretable by non-ML stakeholders. Across materials subdomains, two complementary explainability paradigms are prominent. The first is a post-hoc, model-agnostic explanation, where feature attribution tools are applied after training to quantify how inputs drive predictions locally (per sample) and globally (across a dataset). In porous materials and adsorption problems, SHAP- and LIME-based analyses have been used to expose dominant operational and structural factors underlying predicted uptake trends, thereby connecting predictive performance to interpretable drivers rather than opaque correlations193,194. Similar SHAP-guided interpretation has been integrated into alloy design pipelines to clarify feature influence and to rationally delimit a Bayesian optimization search space for multi-property targeting, linking explainability directly to actionable design decisions rather than retrospective visualization195. In catalysis, SHAP has likewise been used to extract physically interpretable determinants of reactivity within a theory-infused neural framework, emphasizing explanation as a route to mechanistic insight into electronic contributions rather than a purely statistical ranking196. In manufacturing-relevant contexts, explainability is also being extended to high-dimensional sensing modalities: deep-learning models applied to in situ process video streams have been coupled with multiple XAI techniques to distill actionable recommendations from process dynamics, while explicitly acknowledging risks such as overinterpretation and confounding in human-in-the-loop inference197. For signal-based classification tasks, activation mapping approaches such as Score-CAM have enabled identification of the most informative sensor regions and time segments, providing a concrete bridge between learned decision boundaries and physically meaningful sensing responses198. The second paradigm focuses on intrinsically interpretable or theory-/physics-infused modeling, where interpretability is embedded in the representation or architecture, reducing reliance on purely post-hoc explanations. Composition-based attention models provide a clear example: interpretability mechanisms incorporated into CrabNet enable inspection of learned element representations and attention patterns, facilitating chemical-space visualization, diagnosis of dataset imbalance effects, and identification of potential data/model issues through attention dynamics199. Related efforts explicitly target deep network transparency in composition-only predictors, where XAI toolkits are applied to analyze whether inferred element importances align with expected chemical attributes and whether predicted stability landscapes are consistent across binary composition spaces, thereby clarifying both strengths and failure modes of black-box predictors200. In alloy thermodynamics and phase-boundary prediction, graph-based models have been paired with explainability layers to identify influential elemental interactions and to derive empirical equations that improve usability and interpretability for design-oriented workflows201. In parallel, theory-infused deep learning has been used to decompose predicted energetic contributions into physically meaningful components, emphasizing interpretability as a structural property of the model rather than an external add-on202. Hybrid frameworks that augment feature-based models with learned graph-derived descriptors and then “decode” latent representations into explicit formulas via surrogate modeling and symbolic regression further illustrate a route to interpretability that yields analyzable, human-readable descriptors without fully sacrificing predictive performance203. Conceptually, these approaches align with broader calls in materials AI to combine data-driven models with physical knowledge and to use explainability to improve transparency and adoption192. A recurring constraint across materials domains is data sparsity and heterogeneity, which shapes how explainability is deployed and what it can credibly support. Small-data perspectives explicitly frame interpretability as essential for ensuring that limited datasets yield models that remain scientifically informative and not merely predictive, while also motivating constrained learning strategies (e.g., active learning, transfer learning) and physically grounded descriptors191. Physics-informed augmentation and extrapolative modeling strategies provide a concrete example of how interpretability and physical constraints can be coupled under data scarcity: process-variable models for enzymatic fiber refining achieved strong predictive performance while using physics-informed data augmentation and physics-informed regression to support extrapolation and targeted experimental verification, emphasizing a workflow where model-driven guidance remains connected to measurable process variables and mechanistic interpretation204. In parallel, literature-scale synthesis efforts that combine meta-analysis with explainable ML have used XAI to identify the most consequential formulation and processing features in lignin-nanoparticle Pickering emulsions, illustrating how explainability can convert heterogeneous reports into design heuristics while highlighting the importance of curated, structured datasets for robust modeling205. Collectively, these studies reinforce that XAI outputs should be treated as explanations of model behavior under available data, useful for validation, prioritization, and constraint setting rather than direct evidence of causality, particularly when datasets are noisy, biased, or incomplete197,206. Within an AI-powered, open-source infrastructure for accelerating materials discovery and advanced manufacturing, these patterns motivate an implementation stance where explainability is not an optional visualization step but a standard, versioned artifact of the modeling pipeline. Practically, this means supporting both model-agnostic attribution (e.g., SHAP/LIME for tabular and structured descriptors) and modality-appropriate explainability for deep models (e.g., attention inspection for composition/graph architectures, activation mapping for signals and imagery), while enabling theory-/physics-infused options when domain constraints are known and must be enforced194,198,199,201. Under small-data regimes and heterogeneous experimental records, XAI should be coupled to dataset diagnostics, uncertainty-aware evaluation, and human review to reduce overinterpretation and to maintain traceability from data provenance to predictions and derived design recommendations191,197,204,206. This integration directly supports reproducible decision-making in discovery and manufacturing settings by making the rationale for screening, optimization, and experimental prioritization explicit and inspectable across users, datasets, and model revisions. Generative AI in materials science Generative models have become a central component of AI-enabled inverse design, enabling the creation of candidate compositions and structures by learning high-dimensional distributions rather than only predicting properties of known materials. This paradigm complements high-throughput screening by proposing new data points that can expand accessible phase spaces and reduce experimentalist bias in exploration workflows207,208,209. Across materials classes, the practical impact of generative methods is strongly conditioned by (i) the representation used to encode structure and chemistry, and (ii) the degree to which generated candidates can be validated for stability and synthetic feasibility209,210. Consequently, recent work increasingly frames generative modeling as an integrated design loop, generation, filtering by physics- or ML-based surrogates, and experimental or high-fidelity computational confirmation, rather than as a standalone sampling procedure20,207,208. Classical generative models Particularly, VAEs and GANs, have been widely adopted as foundational tools for inverse materials discovery because they enable controlled sampling of latent spaces to propose new compositions, molecules, and microstructures209,210. In materials informatics, these approaches have been applied across multiple design scenarios, including composition design, crystal structure search, microstructure characterization, and defect-related image generation, underscoring their versatility beyond molecule-only settings209,211. Their effectiveness, however, remains tightly coupled to how materials are encoded: low-complexity representations (e.g., phase-space and composition-only encodings) improve interpretability and breadth, whereas coordinate- or image-based encodings better support downstream coupling to physics-based validation but increase modeling complexity and overfitting risk209. Despite their demonstrated utility, GAN-based pipelines face well-documented technical constraints that are particularly consequential in materials settings, where data are often limited, and distributions can be multi-modal. Reviews emphasize that robust GAN training often requires large datasets, while common augmentation strategies may increase sample count without adding fundamentally new information, motivating the incorporation of uncertainty-aware or data-efficient strategies when operating in small-data regimes211,212,213. In addition, generated samples may remain overly similar to the training set, limiting novelty and motivating hybridization strategies (e.g., combining GANs with continuous-latent models or embedding domain priors through tailored objectives) to better balance fidelity with diversity211. These limitations are also observed in crystal generation tasks, where vanilla GANs can exhibit training instability and mode collapse, and even stabilized variants (e.g., Wasserstein-based objectives) may still struggle to consistently reproduce physically realistic symmetry in generated structures, indicating that model choice and representation jointly govern scientific usefulness214. More broadly, survey-level analyses stress that synthetic viability and interpretability remain persistent barriers for mainstream adoption of generative models in materials research, reinforcing the need for validation-aware workflows and transparent error analysis when deploying VAEs and GANs for discovery rather than interpolation210. Diffusion models for materials generation and design Diffusion models have emerged as a strong alternative to adversarial generative approaches for materials discovery because they generate candidates through iterative denoising, which can improve training stability and yield higher-fidelity structures in representation-sensitive settings214. In comparative crystal-generation studies, diffusion models produced more realistic and symmetric structures than vanilla GANs and WGAN variants, indicating an empirical advantage in capturing crystallographic regularities214. Recent work further emphasizes scalability: diffusion models coupled to unified crystal representations have been trained on datasets containing millions of materials and evaluated using downstream-relevant metrics such as per-composition formation energy and convex-hull stability from DFT, explicitly addressing the known mismatch between generic generative metrics and discovery objectives215. This shift reinforces the broader need for chemically grounded evaluation in generative materials workflows, including hybrid strategies that combine lightweight distributional metrics with physics-based screening for high-confidence candidates216. A second capability that makes diffusion models particularly relevant for inverse design is conditional generation, which allows structures to be generated under compositional, structural, processing, or property constraints while preserving sample diversity215,217. For crystalline materials, point-cloud diffusion approaches demonstrate this principle by generating structures conditioned on elemental composition and validating candidates using DFT-based screening (including stability-related analyses such as hull energy and phonon checks), creating a direct bridge between generation and physics-based verification218. For complex non-crystalline regimes, diffusion modeling has also been extended to amorphous materials, where conditional generation enabled sampling across compositions and processing conditions while reproducing short- and medium-range order and multiple macroscopic properties at substantially reduced computational cost relative to conventional MDs, with occasional post-generation relaxation required for rare outlier environments217. From an infrastructure and autonomy perspective, diffusion models map naturally onto the design principles of SDLs and closed-loop materials optimization because they can generate diverse candidates that are subsequently filtered, ranked, and iteratively refined using downstream stability and property models, reducing reliance on exhaustive high-cost simulations and experimental trials219 From LLMs to agentic AI in materials discovery While LLMs have demonstrated significant value as knowledge assistants for materials science, supporting literature mining, hypothesis generation, representation learning, and property prediction, the key conceptual shift lies in transforming LLMs into agentic systems capable of planning, tool use, memory, and iterative reasoning across complete discovery workflows. Recent perspectives emphasize that the impact of AI in materials research increasingly depends not on isolated model performance, but on the integration of LLMs into autonomous or semi-autonomous pipelines that coordinate generation, evaluation, validation, and human feedback within closed-loop discovery systems220,221. In this context, agentic AI represents a transition from task-level automation toward collaborative, goal-driven scientific workflows. Several recent frameworks illustrate this evolution by combining LLM reasoning with generative and predictive models. Reinforcement-learning-enhanced language models such as MOFGPT demonstrate how LLM-based generators can be coupled to property predictors and reward functions to enable inverse design under explicit validity, novelty, and performance constraints, outperforming fine-tuning-only approaches in navigating complex chemical spaces222. More broadly, agent-based architectures such as MAPPS, MatAgent, and MOFGen integrate LLMs with diffusion-based structure generators, physics-informed evaluators, synthesis feasibility checks, and external scientific tools, enabling iterative refinement of candidates and substantially improving stability, novelty, and targeting performance relative to standalone generative models221,223,224. These systems exemplify how agentic AI can orchestrate heterogeneous components, language reasoning, generative modeling, physics-based screening, and expert feedback into coherent discovery pipelines that more closely resemble human scientific workflows. Despite these advances, several limitations remain that constrain the deployment of agentic AI in real-world materials research. First, reliability and scientific grounding remain central challenges. LLMs trained predominantly on general text data may exhibit hallucinations, weak numerical reasoning, or inconsistent handling of structural and quantitative information, limiting trust in autonomous decision-making225,226. While retrieval-augmented generation, reinforcement learning, and physics-informed tool integration can mitigate these issues, robust self-verification and error-detection mechanisms are still under active development221,227. Second, explainability and transparency remain critical barriers, particularly as agentic systems grow in complexity. Although attention-based transformers and language-centric representations offer partial interpretability, the reasoning chains of multi-agent systems often remain opaque, complicating validation, debugging, and regulatory acceptance227,228. A further challenge concerns standardization and benchmarking. The diversity of representations (e.g., SMILES, SLICES, MOFid, text descriptions), evaluation metrics, and datasets across studies makes systematic comparison difficult and limits reproducibility229,230,231. Recent benchmarking efforts highlight that general-purpose LLMs often underperform domain-specific or instruction-tuned models on materials tasks, underscoring the need for standardized benchmarks, task-specific fine-tuning protocols, and shared evaluation frameworks tailored to materials discovery objectives231. In parallel, domain-specific LLMs, trained or adapted using curated materials corpora, structured representations, and multimodal data are increasingly recognized as essential for achieving reliable performance without excessive model scaling226,232,233. Looking forward, future research is likely to focus on three converging directions. First, developing domain-adapted, multimodal foundation models that integrate text, structure, images, and experimental data will be critical for robust reasoning and generalization across materials classes225,234. Second, advancing agentic architectures with explicit planning, uncertainty awareness, and human-in-the-loop control will be essential to balance autonomy with scientific accountability and trust221,227,235. Third, establishing standardized, open infrastructures including benchmarks, data schemas, tool interfaces, and validation protocols will be necessary to enable reproducible, interoperable, and scalable deployment of agentic AI in both academic and industrial materials research228,230,231. Together, these directions position agentic AI not as a replacement for human expertise, but as a collaborative layer that augments scientific reasoning, accelerates discovery cycles, and supports more efficient and sustainable materials innovation. To facilitate comparison across predictive, generative, and agentic AI approaches discussed in “AI modeling”, Table 5 provides a consolidated overview of representative model classes, their typical input representations, application domains, and task-appropriate performance metrics. Because these approaches address fundamentally different objectives, performance metrics are reported as evaluation criteria rather than as directly comparable numerical benchmarks. AI in cloud-based infrastructure for materials science The rise of AI has driven leading cloud providers to develop AI solutions tailored to advanced technologies for business and research. Cloud-based AI services offer benefits like scalability, cost-efficiency, and democratization of advanced tools, allowing organizations to access powerful AI resources without significant upfront infrastructure investment. Platforms like TensorFlow and PyTorch thrive with the support of cloud-based compute resources, enabling continuous innovation and accessibility. In materials science, these services accelerate research by analyzing vast datasets, predicting material properties, optimizing manufacturing processes, and simulating experiments. For instance, Google’s DeepMind has leveraged AI to predict protein folding, a breakthrough with profound implications for drug design and material engineering. While generative AI accelerates innovation, it requires substantial computational power, making cloud-based AI critical for scaling these advancements. The AI cloud services market, led by Amazon SageMaker236, Google AI Cloud Platform237, Microsoft Azure AI238, and IBM Watson, caters to diverse use cases with specialized features. Amazon SageMaker facilitates tasks such as predicting material properties and optimizing experimental parameters. Google AI Cloud Platform excels with frameworks like TensorFlow and AutoML, ideal for recognizing patterns in materials datasets. Microsoft Azure AI enhances collaborative projects through tools like Azure ML and Power BI, while IBM Watson’s expertise in NLP aids in mining insights from scientific literature and patents. Despite their strengths, these platforms require extensive domain-specific customization and multidisciplinary collaboration to fully unlock their potential in material science. Complementing commercial solutions, open-source deployment platforms democratize access, provide code transparency, and enable broader participation in AI-powered materials research. Deploying AI-powered platforms and tools in this domain offers unprecedented opportunities for innovation. AI models can rapidly sift through massive datasets, predict material properties, and simulate experimental outcomes, reducing the time and resources needed for discovery. However, realizing these benefits hinges on robust deployment strategies that ensure accessibility, scalability, and transparency. Open-source initiatives have a critical role in democratizing access to these tools, fostering collaboration, and driving collective progress toward sustainable solutions. AI-infrastructure platforms and deployment tools Version control platforms like GitHub239 have become a cornerstone for deploying AI models in materials science. Their intuitive interface and robust versioning capabilities allow developers to manage codebases, collaborate effectively, and integrate tools like CI/CD pipelines for streamlined deployment. Initiatives like The Materials Project23 provide comprehensive materials databases and tools like pymatgen for property simulations, while Intel Labs240 develops AI-driven frameworks like matsciml for ML applications. DeepMaterials promotes accessible informatics with tools like SLMat241 for serverless materials design, and Materials Virtual Lab242 applies graph-based AI models with repositories like matgl. Dedicated web platforms offer a user-centric approach to deploying AI models by enabling real-time interaction through graphical interfaces or APIs, making advanced tools accessible to researchers without coding expertise. For instance, GitHub Pages and Docker enhance GitHub deployments with lightweight hosting and consistent runtime environments, while frameworks like Flask and Streamlit provide dynamic dashboards for intuitive interaction. A notable example is a Flask-based battery data analysis platform designed for lithium-ion batteries243, which integrates ML to estimate the state of charge (SOC) with errors under 2%, alongside interactive visualizations and real-time monitoring capabilities. Similarly, Streamlit-powered applications like LatticeML244 and Stmol245 enable property predictions and 3D molecular visualizations. Additionally, platforms like M2Hub246 and PcMSP247 demonstrate how web-based tools can combine ML with standardized workflows for property simulation. While GitHub excels in collaboration and transparency, web platforms offer broader accessibility but often require additional hosting resources and maintenance. Best practices for deploying data models involve thoughtful organization and structure to ensure usability and reproducibility. Models should be containerized, using tools like Docker, Kubernetes or Virtualenv to create portable environments that eliminate compatibility issues. Clear documentation, including usage instructions, dependencies, and version history, is essential. For GitHub repositories, modularized codebases and the inclusion of pre-trained model weights or scripts for fine-tuning greatly enhance usability. On web platforms, incorporating tools for visualization, such as Plotly or Dash248, can bridge the gap between raw data and actionable insights. These deployment strategies are particularly valuable in materials discovery, where interdisciplinary teams rely on intuitive and well-documented tools to foster collaboration. By enabling seamless access to AI tools and data, open-source infrastructures play a pivotal role in addressing sustainability challenges. These platforms empower researchers to share findings and avoid redundant efforts, thereby conserving resources and reducing carbon footprints associated with trial-and-error experimentation. Resources like OpenKIM249, which curate interatomic models and provide tools for integrating them into simulation workflows, exemplify the power of open-access databases in enhancing reproducibility and collaboration in computational materials science. Furthermore, democratizing AI tools accelerates the identification of materials and processes, amplifying the collective impact of global research efforts. Ultimately, the proliferation of open-source AI infrastructures holds immense potential to catalyze materials innovation. Accessibility and data transparency Open data, as mentioned before, plays a pivotal role in the development and transparency of AI systems; its availability and use vary significantly across different models. Truly open-source systems often incorporate openly available datasets, ensuring transparency and reproducibility. For example, BigScience’s BLOOM250 stands apart with its full transparency: the model, training data, and documentation are openly accessible. Open data initiatives, such as The Pile251 or Common Crawl252, provide vast repositories of text and other materials that support the training of AI models while allowing scrutiny of their content and provenance. However, even open data requires significant labor to clean, curate, and adapt for AI purposes, adding layers of complexity to the notion of openness. In contrast, semi-open and closed AI systems often obscure their data sources or rely heavily on proprietary datasets, creating significant issues related to transparency and ethics. Semi-open systems, such as Meta’s LLaMA253 and Google’s Gemini254, exemplify this challenge. For instance, Meta’s LLama does not disclose the specific datasets used for training, which makes it difficult for researchers to evaluate biases or assess whether the data complies with IP laws. Similarly, Google’s Gemini, while claiming to build on web data, provides limited details about its training data, leaving its methodology open to speculation. This lack of transparency restricts researchers’ ability to identify potential biases or understand the model’s limitations. In material discovery, such opacity can hinder the development of AI-driven tools that rely on high-quality, traceable datasets to generate accurate predictions for material properties or to guide experimental design. Furthermore, semi-open approaches often rely on datasets that aggregate content without clear attribution, raising ethical concerns about the misuse of IP and the perpetuation of harmful stereotypes embedded in unvetted sources. Closed systems, like OpenAI’s GPT-4, take opacity even further by withholding all information about their training datasets, citing competitive advantages and security concerns. While this might protect proprietary interests, it avoids scrutiny and sidesteps accountability. For example, GPT-4 has been criticized for likely relying on vast web-scraped datasets and potentially including sensitive or copyrighted material255,256, but OpenAI’s refusal to disclose this information prevents independent verification. Another example is IBM’s Watson257, which operates with heavily curated, proprietary datasets that are inaccessible to the public, limiting external evaluation of its models’ robustness and fairness. In material discovery, the use of closed systems introduces challenges, as proprietary data sources may not cover the diversity of materials required for comprehensive predictions or could exclude critical information due to IP restrictions. This not only slows progress but also creates barriers for collaboration, which is essential in advancing fields such as energy storage materials, catalysts, or nanotechnology. To address these pressing challenges, there is a need for not only adopting open data policies but also implementing robust frameworks that enforce ethical and equitable data usage, ensuring respect for IP and fostering greater accountability in AI development. By prioritizing transparency and accessibility, particularly in fields like material discovery, AI can significantly accelerate innovation, enabling the identification of novel materials with unique properties while upholding ethical standards and fostering trust in AI-driven research. Quantum computing Quantum computing emerges as a transformative tool for materials discovery, offering unprecedented capabilities to model complex quantum systems with higher accuracy and efficiency than classical methods. Google’s Willow quantum chip exemplifies this potential, achieving groundbreaking advancements in error correction and stability. This enables computations that would take classical systems thousands of years to complete to be performed in mere seconds, which could accelerate the discovery and optimization of new materials with unprecedented speed and precision258. Beyond superconducting qubits, molecular qubits offer a chemically tunable alternative with long coherence times and atomic-scale precision. These qubits, based on transition metals, lanthanides, and actinides, provide enhanced control over spin interactions and could be integrated into hybrid quantum-classical systems for materials discovery259. Their tunability allows for precise engineering of qubit interactions, making them promising candidates for quantum simulations in chemistry and material science. While traditional techniques such as DFT and coupled cluster methods have provided valuable insights into material properties, their computational cost increases rapidly for large, strongly correlated systems. Quantum algorithms take advantage of quantum parallelism and entanglement, overcoming many of the limitations of purely classical methods. Among these algorithms, the Variational Quantum Eigen solver (VQE)260, has proven effective in calculating the ground-state energies of molecules like GeO2, SiO2, and LiH, and has facilitated the creation of databases documenting the corresponding Hamiltonians261, enabling more accurate material simulations. Recent work underscores the need to further develop classical embedding strategies and sampling techniques to complement quantum methods, since combining high-level descriptions of strongly correlated regions with lower-level treatments of less critical regions can accelerate workflows and guide the search for materials with improved properties259 Ultimately, quantum computing offers an innovative pathway for addressing the challenges associated with strongly correlated electronic structures, an area where even advanced classical techniques such as DFT and Coupled Cluster expansions struggle to provide accurate predictions. By leveraging superposition and entanglement, quantum computers promise polynomial or exponential speedups in simulating complex material behaviors259. To fully capitalize on these capabilities, robust noise modeling and quantum error mitigation strategies must be developed. This involves understanding how environmental effects such as interactions with phonons, spins, and electromagnetic fluctuations contribute to qubit decoherence. Researchers are actively developing noise-mitigation techniques, such as readout-error correction, zero-noise extrapolation, and randomized compiling, to reduce error rates and extend coherence times in superconducting and spin-based qubits. These approaches provide practical solutions without requiring the extensive overhead of fully fault-tolerant quantum computing259. An extension of VQE, known as qubit-ADAPT-VQE, optimizes these procedures by reducing both the number of iterations and circuit complexity, allowing for chemically accurate simulations in studies of adsorption in MOFs and CO2 capture systems262. Grover’s search algorithm263 has also been used to identify Ni-Ti shape-memory alloys with minimized thermal hysteresis, offering a more efficient path than classical optimization. Meanwhile, quantum phase estimation (QPE) has shown great potential in the study of corrosion-resistant materials by simulating high-temperature behaviors in niobium-based alloys and magnesium systems264. As quantum hardware matures, resource estimates suggest that QPE-based approaches will become even more widely accessible. An important challenge, however, lies in developing deep knowledge of decoherence mechanisms within these platforms, as each qubit architecture, whether superconducting, spin-defect based, or trapped ions, faces materials-based noise that can significantly reduce quantum coherence259. In superconducting circuits, parasitic two-level systems at the interfaces of thin oxide layers are known to limit coherence times259. For spin-defect qubits in semiconductors, atomic-scale disorder or unintended dopants can introduce unwanted charge fluctuations that shorten qubit lifetimes259. Trapped ions and molecules, meanwhile, are vulnerable to surface noise and patch potentials on electrodes, which can introduce anomalous heating. Although targeted experiments and careful materials synthesis can alleviate some of these issues, accurate modeling of decoherence processes for large and heterogeneous qubit systems on classical hardware remains extraordinarily demanding, driving further research into hybrid quantum-classical approaches for error mitigation and noise modeling259. A key step in leveraging quantum computing frameworks is encoding classical data into quantum-compatible formats. For example, Qiskit Nature261 offers amplitude, basis, and angle encoding, which are essential for both quantum ML and quantum chemistry workflows. PennyLane265, a hybrid quantum-classical

Fuentes