Multilayer dynamic networks are ubiquitous across various domains, emphasizing the importance of thoroughly elucidating the interactive relationships among their constituent entities. With the progression of data acquisition technologies, multimodal data has been collected for a multilayer network, enabling the depiction of network structural features from various perspectives. Modeling the multilayer network becomes a challenging task due to multivariate spatiotemporal dynamics and diverse characteristics of entities from multimodal variables. This paper develops a novel methodology for multimodal spatiotemporal modeling, tailored for the analysis of a multilayer dynamic network. The network comprises a number of nodes and multiple layers, described through multimodal variables, notably event frequencies and attributes. Assuming all layers share a common community structure, we fuse node connectivity and attribute data within the context of the network's community via Bernoulli and Poisson distributions. Illuminating node connectivity patterns, we propose a multilayer spatiotemporal Hawkes process with shared community to depict node interactions based on event frequency data. Additionally, we develop a hierarchical Expectation-Maximization (EM) algorithm for parameter estimation, offering theoretical guarantee of local convergence. A comprehensive evaluation is undertaken through numerical experiments and a real case study involving an urban metro network system to validate the effectiveness of the proposed method.