把docx文件转换为md文件,图片、表格、标题、章节保持不变
Go to file
2025-09-04 16:56:18 +08:00
test123 Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00
testfile2 Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00
.gitignore Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00
analyze_outline.py Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00
create_test_doc.py Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00
docx_to_md.py Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00
README.md Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00
requirements.txt Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00
testfile.md Initial commit: DOCX to Markdown converter with improved heading level handling 2025-09-04 16:56:18 +08:00

DOCX to Markdown Converter

This Python script converts DOCX files to Markdown format, preserving formatting such as headings, bold, italic, underline, strikethrough, and highlight. It also extracts images from the DOCX file and saves them in an images directory.

Features

  • Converts DOCX to Markdown format
  • Preserves text formatting (headings, bold, italic, underline, strikethrough, highlight)
  • Extracts images and saves them in an images directory
  • Processes tables and converts them to Markdown format
  • Command-line interface for specifying input and output files

Requirements

  • Python 3.x
  • python-docx library

Install the required dependencies with:

pip install python-docx

Usage

python docx_to_md.py <input.docx> [output_directory]

Examples

# Convert a DOCX file to Markdown (output to current directory)
python docx_to_md.py document.docx

# Convert a DOCX file to Markdown with a specific output directory
python docx_to_md.py document.docx /path/to/output/directory

# If not specified, the output directory defaults to the current directory
python docx_to_md.py document.docx

The output Markdown file will have the same name as the input DOCX file, but with a .md extension.

How It Works

  1. The script reads the DOCX file using the python-docx library
  2. It extracts all images from the document and saves them in an images subdirectory
  3. It processes paragraphs, preserving formatting:
    • Headings are converted to Markdown headings (#, ##, ###, etc.)
    • Bold text is wrapped in **
    • Italic text is wrapped in *
    • Underlined text is wrapped in *
    • Strikethrough text is wrapped in ~~
    • Highlighted text is wrapped in **
  4. Tables are converted to Markdown table format
  5. The output is written to the specified Markdown file

Output Structure

The script creates the following structure:

output.md          # The main Markdown file
images/            # Directory containing extracted images
  image_1.png
  image_2.png
  ...

License

This project is licensed under the MIT License.