Image Style Transfer aims to replicate the style of a reference image based on the content from a text description or another image. With the significant advancements in image generation through diffusion models, recent studies have attempted to either fine-tuning embeddings to learn the single style or utilizing the pre-trained CLIP image encoder to extract style representations. However, style-tuning requires substantial computational resources and the pre-trained CLIP image encoder is trained for semantic understanding rather than for style representation. To address these challenges, we introduce a style-aware encoder and a well-organized style dataset called StyleGallery to learn a good style representation that is crucial and sufficient for generalized style transfer without test-time tuning. With dedicated design for style learning, this style-aware encoder is trained to extract expressive style representation from multi-level patches with decoupling training strategy, and StyleGallery enables the generalization ability. Moreover, we employ a content extraction and content-fusion encoder to enhance image-driven style transfer. We highlight that, our approach, named StyleShot, is simple yet effective in mimicking various desired styles, i.e., 3D, flat, abstract or even fine-grained styles, without test-time tuning. Rigorous experiments validate that, StyleShot achieves superior performance across a wide range of styles compared to existing state-of-the-art text- and image-driven methods.