1. 调度算法

1.1. 替换方式

调度算法放在目录 /opt/AntDen/scheduler/temple 下,每一个算法是一个可以执行的脚本或者二进制

1.2. 通过模拟器给算法打分

运行命令 /opt/AntDen/simulator/tools/analyse clotho (其中clotho为你自己的算法名称) 跑出来最终的数值越小算法越好

1.3. 更换算法的方式

启动调取器时添加环境变量 AntDenSchedulerTemple=pandora:clotho(其中clotho为你自己的算法名称)

1.4. 在antden中,什么是一个调度算法

antden给调取算法做了统一的数据持久化,调度算法不需要考虑数据怎么保存的问题,如果服务被重启,antden会以标准输入的 方式重新把当前的状态输入给调度算法。 调度算法是一个纯在内存中运行的程序,机器资源信息,作业信息,信息的变更等都会通过标准输入提供给调度算法, 调度算法通过标准输出把要调度起来运行的task信息输出。

1.5. 算法的编写方式

调度算法包括一下几个函数

1.5.1. setMachine( %m )

  ip1:
    hostname: 10-60-79-144
    group: foo
    envhard: arch=x86_64,os=Linux
    envsoft: SELinux=Disabled
    switchable: 1
    workable: 1
    role: slave,ingress,master
    mon: health=1,load=0.1
  ip2:
    hostname: 10-60-79-144
    envhard: arch=x86_64,os=Linux
    envsoft: SELinux=Disabled
    switchable: 1
    group: foo
    workable: 1
    role: slave,ingress,master
    mon: health=1,load=0.1

告知调度算法机器的信息

1.5.2. setMachineAttr( ip, k, v )

修改某个机器的某个属性, 如机器ip1如果掉线, 会调用这个函数的 setMachineAttr( 'ip1', 'mon', 'health=0' )

1.5.3. setJobAttr( jobid, k, v )

同理修改的是作业的属性

1.5.4. setResource( %r )

 ip1:
  - [ CPU, 0, 2048 ]
  - [ GPU, 0, 1 ]
  - [ GPU, 1, 1 ]
  - [ MEM, 0, 1839 ]
  - [ PORT, 65000, 1 ]
  - [ PORT, 65001, 1 ]

机器的资源信息

1.5.5. loadTask( @task )

  -
    jobid: J01
    taskid: T01
    status: stoped
    hostip: 127.0.0.1
    executer: ~
    resources:
      - [ CPU, 0, 2048 ]
      - [ GPU, 0, 1 ]
   -
    jobid: J01
    taskid: T02
    status: stoped
    hostip: 127.0.0.1
    executer: ~
    resources:
      - [ CPU, 0, 2048 ]
      - [ GPU, 0, 1 ]

如果调度器被重启,antden通过这个函数告知调度算法之前已经调取起来当前还没结束的任务

1.5.6. submitJob( conf )

  conf:
  -
    executer:
      name: exec
      param:
        exec: echo success
    scheduler:
      count: 10
      envhard: arch=x86_64,os=Linux
      envsoft: app1=1.0
      ip: 127.0.0.1 #
      resources:
        [ GPU, 0, 2 ]
  -
    executer:
      name: exec
      param:
        exec: echo success
    scheduler:
      count: 10
      envhard: arch=x86_64,os=Linux
      envsoft: app1=1.0
      resources:
        [ GPU, 0, 2 ]
  group: foo
  nice: 5
  domain: abc.com
  jobid: J.20200206.114247.252746.499
  owner: root
  name: job.abc

给调度算法提交作业

1.5.7. stop(jobid)

让调度算法要停止该作业

1.5.8. stoped(@task)

    taskid: T01
    jobid: J01
    status: success
    result: exit:0
    msg: mesg1
    usetime: 3
   -
    taskid: T02
    jobid: J01
    status: success
    result: exit:0
    msg: mesg1
    usetime: 3

告知调度算法,这些任务已经退出,并提供了任务的一些退出信息。 调度算法可以根据这个信息来判断任务是否异常,调度算法可以选择重新在产生一个任务

1.5.9. apply()

antden会定时去调用调度算法的这个函数(默认1秒钟调用一次),看是否有task产生。 这个是调度算法的输出。在调用这个函数的时候,调度算法如果找到了可以运行人任务,把它输出到标准输出

1.5.10. time()

告知当前时间,模拟器运行时是一个快速执行的过程,在调用time的时候调度算法需要返回一个时间,或者是经过了几个周期 模拟器会结合这个数值来评判一个调度算法的好坏

1.6. 调度算法的输入输出的样子

在执行模拟器任务后,在/opt/AntDen/scheduler/temple/run 下 可以看到类似 20200726_224304.9648.in 20200726_224304.9648.out的文件 其中in后缀的是提供给调取算法的标准输入, out后缀的是调度算法print在标准输出的信息。 如果这两个文件里面的格式是正确的,这个调度算法可以替换到antden中使用(但效果如果需要看具体的实现)

1.6.1. 输入

{"name":"time","data":[1]}
{"name":"apply","data":[]}
{"name":"setMachine","data":["10.0.1.1",{"hostname":"node.1.1","switchable":1,"role":"slave","workable":1,"group":"foo","envhard":"arch=x86_64,os=Linux","envsoft":"SELinux=Disabled","mon":"MEM=1159,health=1,load=0.26"}]}
{"data":["10.0.1.1",[["CPU",0,"1"],["MEM",0,"1"],["GPU","0","1"],["GPU","1","1"],["GPU","2","1"],["GPU","3","1"]]],"name":"setResource"}
{"name":"setMachine","data":["10.0.1.2",{"hostname":"node.1.2","switchable":1,"workable":1,"role":"slave","group":"foo","envhard":"arch=x86_64,os=Linux","envsoft":"SELinux=Disabled","mon":"MEM=1159,health=1,load=0.26"}]}
{"data":["10.0.1.2",[["CPU",0,"1"],["MEM",0,"1"],["GPU","0","1"],["GPU","1","1"],["GPU","2","1"],["GPU","3","1"]]],"name":"setResource"}
{"name":"setMachine","data":["10.0.1.3",{"mon":"MEM=1159,health=1,load=0.26","envsoft":"SELinux=Disabled","envhard":"arch=x86_64,os=Linux","group":"foo","workable":1,"role":"slave","switchable":1,"hostname":"node.1.3"}]}
{"name":"setResource","data":["10.0.1.3",[["CPU",0,"1"],["MEM",0,"1"],["GPU","0","1"],["GPU","1","1"],["GPU","2","1"],["GPU","3","1"]]]}
{"data":[{"jobid":"J.1","group":"foo","nice":"5","conf":[{"executer":{"name":"exec","param":{"runtime":"3600","exec":"sleep 300"}},"scheduler":{"count":"1","resources":[["CPU",".","1"]],"envhard":"arch=x86_64,os=Linux","envsoft":"app1=1.0"}}]}],"name":"submitJob"}
{"name":"addProduct","data":[{"res":[["CPU",".","1"],["MEM",".","1"],["GPU",".","4"]],"conf":{"group":"foo","name":"vm1","id":"0001","startingtime":"300","cost":"100","shutdowntime":"20","unit":"3600"}}]}
{"name":"time","data":[2]}
{"name":"apply","data":[]}
{"name":"setMachine","data":["10.0.2.1",{"envhard":"arch=x86_64,os=Linux","role":"master","workable":1,"group":"foo","hostname":"node.2.1","switchable":1,"mon":"MEM=1159,health=1,load=0.26","envsoft":"SELinux=Disabled"}]}
{"name":"setResource","data":["10.0.2.1",[["CPU",0,"1"],["MEM",0,"1"],["GPU","0","1"],["GPU","1","1"],["GPU","2","1"],["GPU","3","1"]]]}
{"name":"submitJob","data":[{"nice":"5","conf":[{"scheduler":{"count":"1","resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]],"envsoft":"app1=1.0","envhard":"arch=x86_64,os=Linux"},"executer":{"param":{"runtime":"3600","exec":"sleep 300"},"name":"exec"}}],"jobid":"J.2","group":"foo"}]}
{"name":"time","data":[3]}
{"name":"apply","data":[]}
{"name":"submitJob","data":[{"jobid":"J.3","group":"foo","nice":"5","conf":[{"executer":{"name":"exec","param":{"exec":"sleep 300","runtime":"3600"}},"scheduler":{"count":"1","envsoft":"app1=1.0","envhard":"arch=x86_64,os=Linux","resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]]}}]}]}
{"name":"time","data":[4]}
{"data":[],"name":"apply"}
{"data":[{"group":"foo","jobid":"J.4","conf":[{"executer":{"param":{"exec":"sleep 300","runtime":"3600"},"name":"exec"},"scheduler":{"envhard":"arch=x86_64,os=Linux","envsoft":"app1=1.0","resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]],"count":"1"}}],"nice":"5"}],"name":"submitJob"}
{"name":"time","data":[5]}
{"name":"apply","data":[]}
{"name":"submitJob","data":[{"group":"foo","jobid":"J.5","conf":[{"scheduler":{"count":"1","resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]],"envhard":"arch=x86_64,os=Linux","envsoft":"app1=1.0"},"executer":{"name":"exec","param":{"exec":"sleep 300","runtime":"3600"}}}],"nice":"5"}]}
{"name":"time","data":[6]}
{"name":"apply","data":[]}
{"data":[{"nice":"5","conf":[{"scheduler":{"resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]],"envhard":"arch=x86_64,os=Linux","envsoft":"app1=1.0","count":"1"},"executer":{"name":"exec","param":{"runtime":"3600","exec":"sleep 300"}}}],"jobid":"J.6","group":"foo"}],"name":"submitJob"}
{"data":[7],"name":"time"}
{"name":"apply","data":[]}
{"data":[{"nice":"5","conf":[{"executer":{"name":"exec","param":{"runtime":"3600","exec":"sleep 300"}},"scheduler":{"envsoft":"app1=1.0","envhard":"arch=x86_64,os=Linux","resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]],"count":"1"}}],"jobid":"J.7","group":"foo"}],"name":"submitJob"}
{"data":[8],"name":"time"}
{"data":[],"name":"apply"}
{"name":"submitJob","data":[{"nice":"5","conf":[{"scheduler":{"count":"1","envhard":"arch=x86_64,os=Linux","envsoft":"app1=1.0","resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]]},"executer":{"param":{"exec":"sleep 300","runtime":"3600"},"name":"exec"}}],"jobid":"J.8","group":"foo"}]}
{"name":"time","data":[9]}
{"name":"apply","data":[]}
{"name":"submitJob","data":[{"nice":"5","conf":[{"executer":{"name":"exec","param":{"runtime":"3600","exec":"sleep 300"}},"scheduler":{"resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]],"envhard":"arch=x86_64,os=Linux","envsoft":"app1=1.0","count":"1"}}],"jobid":"J.9","group":"foo"}]}
{"data":[10],"name":"time"}
{"name":"apply","data":[]}
{"name":"submitJob","data":[{"conf":[{"scheduler":{"resources":[["CPU",".","1"],["MEM",".","1"],["GPU",".","1"]],"envsoft":"app1=1.0","envhard":"arch=x86_64,os=Linux","count":"1"},"executer":{"param":{"runtime":"3600","exec":"sleep 300"},"name":"exec"}}],"nice":"5","group":"foo","jobid":"J.10"}]}

1.6.2. 输出

[{"jobid":"J.1","ingress":null,"taskid":"T.1.001","group":"foo","hostip":"10.0.1.3","executer":{"name":"exec","param":{"exec":"sleep 300","runtime":"3600"}},"resources":[["CPU","0","1"]]}]
{"name":"time","data":[3]}
[{"executer":{"param":{"exec":"sleep 300","runtime":"3600"},"name":"exec"},"resources":[["CPU","0","1"],["MEM","0","1"],["GPU","3","1"]],"jobid":"J.2","ingress":null,"group":"foo","taskid":"T.2.001","hostip":"10.0.1.1"}]
{"name":"time","data":[4]}
[{"ingress":null,"jobid":"J.3","hostip":"10.0.1.2","taskid":"T.3.001","group":"foo","executer":{"name":"exec","param":{"runtime":"3600","exec":"sleep 300"}},"resources":[["CPU","0","1"],["MEM","0","1"],["GPU","2","1"]]}]
{"name":"time","data":[5]}
Copyright 2020 - 2020. all right reserved,powered by Gitbook该文件修订时间: 2020-07-26 23:17:56

results matching ""

    No results matching ""